--- title: Curse of Dimensionality keywords: fastai sidebar: home_sidebar summary: "Example with code" description: "Example with code" nb_path: "nbs/07_concepts/curse_of_dimensionality.ipynb" ---
{% raw %}
{% endraw %}

Distance vs. Dimensionality

  • The x-axis represents the distance between two points
  • The y-axis represents the count (this is a histogram)
  • The play button increases the number of dimensions, causing the histogram to shift to the right. This means distances are increasing as dimensions go up!
  • The curse of dimensionality is that as the number of dimensions increases, the distances between any two points also increases. That makes it hard to group things together!
{% raw %}
import numpy as np
import pandas as pd
import plotly.express as px

def generate_data(n, dim):
    x = np.random.normal(0, 1, (n, dim))
    y = np.random.normal(3, 1, (n, dim))

    data = np.concatenate([x, y], axis=0)
    normalized = (data - data.mean(axis=0)) / data.std(axis=0)
    distances = np.linalg.norm(normalized[:, None, :] - normalized[None, :, :], axis=2).flatten()
    
    return distances

n = 100
dims = list(range(1, 10)) + list(range(10, 200, 10))

distances = {dim: generate_data(n, dim) for dim in dims}
df = pd.DataFrame(distances).melt(var_name='dims', value_name='samples')
fig = px.histogram(df, animation_frame='dims', )

fig.update_xaxes(range=[0, 20]);
{% endraw %} {% raw %}
{% endraw %}

Required Samples vs. Dimensionality

As dimensionality increases, you also need more samples.

Consider the simple binary variable case.

  • For each variable, there are two choices.
  • For $k$ variables, we have $2^k$ choices.
  • For a linear increase in variables, you have an exponential increase in choices. The amount of data you need to collect increases exponentially!
  • Suppose you wanted 25 observations of each combination of variables. The amount of data you would need is huge!
{% raw %}
import pandas as pd
import plotly.express as px

num_variables = range(1, 25)
rows_of_data_required = (25*2**k for k in num_variables)
df = pd.DataFrame(
    {'Rows of Data Required': rows_of_data_required,
     'Num of Binary Variables': num_variables}
).set_index('Num of Binary Variables')
fig = px.line(df, x=df.index, y='Rows of Data Required')
fig.update_layout(hovermode="x unified");
{% endraw %} {% raw %}
{% endraw %}